Now I have just learnt something new, Random Forest. Random Forest can do regression as well which I'm going to compare Linear Regression and Random Forest Regression to see which one preforms better. We're going to use
I'm inspired by the blog post from Yhat and I'm going to get the data from http://archive.ics.uci.edu/ml/datasets/Wine+Quality. Now let's look at our data.
In [70]:
%matplotlib inline
import numpy as np
import pandas as pd
import pylab as pl
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score
In [33]:
wine = pd.read_csv('winequality-red.csv', sep=';')
In [24]:
wine.shape
Out[24]:
In [25]:
wine.head()
Out[25]:
In [26]:
wine.plot(kind='scatter', x='fixed acidity', y='quality');
In [28]:
wine.plot(kind='scatter', x='pH', y='quality');
It looks like Linear Regression wouldn't work at this point. I don't see any correlation in the data.
In [44]:
train, test = train_test_split(wine, test_size = 0.2)
In [45]:
train.shape
Out[45]:
In [46]:
test.shape
Out[46]:
In [64]:
cols = ['fixed acidity', 'density', 'pH']
rf = RandomForestRegressor(n_estimators=20)
rf.fit(train[cols], train.alcohol)
# Predict the algohol content
predicted_alcohol = rf.predict(test[cols])
r2 = r2_score(test.alcohol, predicted_alcohol)
mse = np.mean((test.alcohol - predicted_alcohol)**2)
pl.scatter(test.alcohol, predicted_alcohol)
pl.plot(np.arange(8, 15), np.arange(8, 15), label="r^2=" + str(r2), c="r")
pl.legend(loc="lower right")
pl.title("RandomForest Regression with scikit-learn")
pl.show()
In [79]:
cols = train.columns[:11]
clf = RandomForestClassifier(n_estimators=20, max_features=10, min_samples_split=5)
clf.fit(train[cols], train.quality)
pd.crosstab(test.quality, clf.predict(test[cols]), rownames=["Actual"], colnames=["Pred"])
Out[79]:
In [80]:
np.sum(test.quality==clf.predict(test[cols])) / float(len(test))
Out[80]:
In [ ]: